Skip to content

Add EPYC CPU serving skill (vLLM + zentorch)#76

Merged
shailensobhee merged 3 commits into
amd:mainfrom
amd-lalithnc:add-serving-llms-on-epyc
Jun 30, 2026
Merged

Add EPYC CPU serving skill (vLLM + zentorch)#76
shailensobhee merged 3 commits into
amd:mainfrom
amd-lalithnc:add-serving-llms-on-epyc

Conversation

@amd-lalithnc

Copy link
Copy Markdown
Contributor

What

Adds serving-llms-on-epyc: a skill that brings up a single vLLM OpenAI endpoint on an
AMD EPYC CPU host with the zentorch backend, in a container (Docker/Podman) or a conda env.

Flow

  1. Detect the CPU: vendor, EPYC generation + Zen arch, AVX-512, physical cores, NUMA, RAM (detect.py).
  2. Validate the environment (validate.py): container runtime (docker/podman) or conda fallback;
    image present, and if already pulled, import vllm, zentorch inside it; host perf libraries
    (tcmalloc / OpenMP via LD_PRELOAD); HF_TOKEN; RAM.
  3. Resolve + check the model (check_model.py): confirm vLLM supports the architecture via its
    model registry (text or multimodal); reject pooling / non-LLM (not chat endpoints).
    Gated models require HF_TOKEN + license acceptance.
  4. Check RAM fit (estimate_memory.py): weights + KV cache + headroom ≤ host RAM.
  5. Size the runtime from the hardware (cpu_tune.py): bind to socket 0's physical cores and
    set VLLM_CPU_KVCACHE_SPACE; no memory binding by default (NPS2/NPS4 get a perf note).
  6. Confirm: present a sized plan and wait for the user to confirm before launching.
  7. Launch: vllm serve (never --device cpu on vLLM ≥ 0.20).
  8. Verify + hand over: poll /health, validate the /v1/chat/completions endpoint, then print a
    connection table.

Single instance. On any failure it reports the cause + logs and stops, no retry, no debugging loop.

Contents

  • SKILL.md, reference.md, skill-card.md, data/epyc.json
  • scripts: detect.py, validate.py, check_model.py, estimate_memory.py, cpu_tune.py
  • behavioral eval: eval/behavioral/tests/test_serving_llms_on_epyc.py
  • registered in .claude-plugin/marketplace.json (+ regenerated Cursor manifest)

Notes / scope

  • Uses the amdih/zendnn_zentorch image on Docker Hub.
  • KV cache is bf16-only on zentorch CPU; TORCHINDUCTOR_FREEZING=1 requires VLLM_USE_AOT_COMPILE=0.
  • OMP_NUM_THREADS and VLLM_CPU_NUM_OF_RESERVED_CPU are intentionally left unset — vLLM derives
    them (from the bind list / its own default).
  • NUMA default: socket 0's physical cores, no memory binding.

Testing

  • Structural gate (check.sh): passes (0 errors).
  • Behavioral eval (LLM-judged, sonnet): 13/13 — exercises detect → validate → check_model →
    estimate → cpu_tune → confirm, plus the guardrails. Live launch/serve is the manual /
    integration tier on a real EPYC host.

Change-Id: I1dc2362e0983326658b6618015a161ecd44f40e6

Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com>
Change-Id: I1dc2362e0983326658b6618015a161ecd44f40e6
@danielholanda danielholanda requested a review from Mahdi-CV June 25, 2026 22:24
@danielholanda

Copy link
Copy Markdown
Collaborator

@Mahdi-CV Can you help review this?

@amd-lalithnc

amd-lalithnc commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

hi @danielholanda @Mahdi-CV @shailensobhee, can we move ahead with review and CI for this PR? thanks!

@shailensobhee

shailensobhee commented Jun 30, 2026

Copy link
Copy Markdown
Member

Hi @amd-lalithnc , a few things so far. Having benchmarked ZenDNN 6.0 recently, I noticed this versioning issue, that maybe you'd want to clarify in the SKILL itself:
It looks like you can use vlllm 0.23.0 only if you are in a conda env and use the zentorch 2.11 wheel file. If you go the docker/podman route, it's vllm 0.22.0. The latest version listed here is 0.22.0. (https://hub.docker.com/r/amdih/zendnn_zentorch).

Do you agree on this observation? If yes, we may need to clarify this in the SKILL file and associated documentation.

  1. If you have a dual socket machine, how do you dictate for example, use socket 1 (as opposed to socket 0) only ? What if the system's socket 0 is already busy? With this skill, it appears that we will try to force use socket 0, even if socket 1 is idle.

  2. You seem to size KV cache on the whole system's RAM, but since you are binding to socket 0, maybe you'd want to do memory binding too? Else you'd hit massive performance issues accessing KVcache data allocated on memory bound to socket 1.

Conclusion: We can merge this skill, but there are potential performance aspects to narrow down. Thoughts?

cc: @Vkathail

Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com>
Change-Id: I6442cc19df3caa3e0e5f36cc276bf94550d5a95e
@amd-lalithnc

Copy link
Copy Markdown
Contributor Author

hi @shailensobhee, thanks for your thoughts!

  1. 0.23.0 is in validation phase, expected release date, 8th July. current open source version supported is 0.22.0.

  2. have added a logic to default to socket 0, if busy, move to the other socket if available, and if both are busy, proceed with socket 0, with a warning.

  3. we have added limited memory binding, confining memory available per socket. in case of multiple NUMA nodes per socket, a warning message is displayed

let me know if the changes are suitable. thanks!

@shailensobhee shailensobhee left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. All three points I raised earlier have been addressed in the code, verified against the head commit:

  1. vLLM version clarity - data/epyc.json pins vllm_version: 0.22.0 with the matching public container tag (amdih/zendnn_zentorch:vllm_v0.22.0_zentorch_v2.11.0.1_...). Pinning the public stable 0.22.0 is correct while 0.23.0 is still in validation. The 0.23 / zentorch 2.11 TORCHINDUCTOR_FREEZING crash gotcha is documented.
  2. Dual-socket selection - cpu_tune.py now samples per-socket load from /proc/stat, prefers a free socket, falls back to the least-busy one with a warning when both are busy, and supports --socket N to force.
  3. KV-cache locality - memory is now bound to the chosen socket (numactl --cpunodebind/--membind for conda, --cpuset-mems for containers) and KV cache is sized from that socket's local RAM, not whole-system RAM. NPS2/NPS4 multi-node cases emit a note.

Note on CI: the behavioral checks are red due to a CI infra issue, not the skill. The eval harness fails at setup in conftest.py because the runner's claude judge CLI is not authenticated (Not logged in / Please run /login), so zero behavioral assertions actually executed. This is expected for a fork PR where Actions secrets are withheld. All substantive checks pass: skill validation, manifest validation, SkillSpector security scan, and external-reference checks. Recommend a maintainer with CI-secret access re-run the behavioral job (or run it from an in-repo branch) to get a clean green before merge.

@shailensobhee shailensobhee merged commit a8fb081 into amd:main Jun 30, 2026
14 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants